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Abstract 

We  focus  on  decomposition  of  fault-tolerant  real-time  programs  that  are  designed  from  their  fault-intolerant  versions. 
Towards  this  end,  motivated  by  the  concepts  of  state  predicate  detection  and  state  predicate  correction  [1]  for  untimed  systems, 
we  identify  three  types  of  components,  namely,  detectors,  weak  5-correctors ,  and  strong  5-correctors.  We  also  consider 
different  levels  of  fault-tolerance,  namely,  soft-failsafe,  harcl-failsafe,  nonmasking,  soft-masking,  and  hard-masking,  depending 
upon  the  satisfaction  of  safety,  liveness,  and  timing  constraints  in  the  presence  of  faults.  We  show  that  depending  upon  the  level 
of  tolerance,  fault-tolerant  real-time  programs  contain  one  or  more  detectors  and/or  weak/strong-(5  correctors. 


Keywords:  Fault-tolerance,  Real-time,  Component-based  design.  Decomposition,  Bounded-time  recovery 
Formal  methods. 


1  Introduction 

We  analyze  real-time  fault-tolerant  programs  that  are  designed  from  their  fault-intolerant  versions.  Such  fault-tolerant  programs 

may  be  designed  for  maintenance  to  deal  with  a  previously  unanticipated  class  of  faults  or  for  separating  functionality  of  the 

program  from  its  tolerance  properties.  In  both  these  cases,  “reuse”  of  existing  program  is  desirable,  and  possibly  mandatory. 

An  important  concern  for  reuse-based  techniques  for  design  of  fault-tolerance  is  their  completeness.  Intuitively,  completeness 

of  a  reuse-based  technique  captures  the  ability  of  that  technique  to  produce  any  fault-tolerant  program  from  a  fault-intolerant 

program,  say  p,  assuming  that  there  is  some  reuse-based  design  of  a  fault-tolerant  version  of  p.  Said  another  way,  if  p  can 

be  made  fault-tolerant  by  any  reuse  design,  then  the  technique  should  be  able  to  yield  one  such  fault-tolerant  program.  Our 

‘This  work  was  partially  sponsored  by  NSF  CAREER  CCR-0092724,  DARPA  Grant  OSURS01-C-1901,  ONR  Grant  N00014-01-1-0744,  NSF  grant 
EIA-0130724,  and  a  grant  from  Michigan  State  University. 
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focus  in  this  paper  is  on  the  completeness  of  reuse-based  techniques  in  the  context  of  component-based  design  of  fault-tolerant 
real-time  programs. 

Regarding  completeness,  there  are  two  main  issues  in  such  component-based  design:  (1 )  Design  method :  where  one  focuses 
on  transforming  the  given  program  into  a  fault-tolerant  program,  and  (2)  Containment  question :  where  we  want  to  determine 
whether  such  components  exist  in  fault-tolerant  programs  irrespective  of  how  they  are  designed.  Regarding  the  first  issue, 
previously,  in  [1],  Arora  and  Kulkarni  have  presented  a  sound  and  complete  method  for  component-based  design  of  fault- 
tolerant  untimed  programs.  Their  method  is  based  on  the  principle  of  state  detection  and  state  correction.  In  particular,  using 
this  principle,  they  identify  two  components,  namely,  detectors  and  correctors  that  respectively  focus  on  detecting  whether  the 
execution  of  a  given  program  action  is  safe  in  the  given  state  and  restoring  the  execution  of  a  program  to  a  state  where  a  certain 
state  predicate  is  true.  They  subsequently  show  that  (1)  detectors  are  the  necessary  and  sufficient  building-block  for  designing 
failsafe  fault-tolerant  programs,  i.e.,  programs  that  satisfy  their  safety  specification  in  the  presence  of  faults,  (2)  correctors  are 
necessary  and  sufficient  building  blocks  for  designing  nonmasking  programs,  i.e.,  programs  that  recover  to  legitimate  states 
after  occurrence  of  faults,  and  (3)  both  are  necessary  and  sufficient  for  masking  programs,  i.e.  programs  where  both  safety  and 
liveness  specification  are  satisfied  in  the  presence  of  faults. 

In  this  paper,  we  focus  on  the  second  question.  In  particular,  we  investigate  whether  the  idea  of  state  detection  and  correction 
can  be  applied  to  real-time  fault-tolerant  programs.  Towards  this  end,  we  define  three  types  of  components,  namely,  detectors , 
weak  5-correctors,  and  strong  6-correctors.  Similar  to  [1],  detectors  are  based  on  the  concept  of  detecting  state  predicates. 
However,  based  on  the  closure  properties  of  the  correction  state  predicate,  we  introduce  weak  and  strong  d-correctors. 

We  illustrate  that  depending  upon  the  level  of  fault-tolerance,  existing  real-time  fault-tolerant  programs  that  reuse  their  fault- 
intolerant  version  contain  one  or  more  of  the  above  components.  We  show  the  existence  of  the  components  by  investigating 
the  necessary  conditions  under  which  fault-tolerance  can  be  provided  in  the  context  of  real-time  programs.  Towards  this  end, 
we  refine  the  levels  of  fault-tolerance  considered  in  [1]  based  on  the  satisfaction  of  safety,  liveness,  and  timing  requirements 
(e.g.,  deadlines)  in  the  presence  of  faults.  In  particular,  we  consider  soft-failsafe,  hard-failsafe,  nonmasking,  soft-masking,  and 
hard-masking  fault-tolerance  fcf.  Section  2  for  precise  definitions).  Intuitively,  a  soft  fault-tolerant  program  is  not  required  to 
meet  its  timing  constraints  in  the  presence  of  faults.  However,  in  the  absence  of  faults  a  soft  fault-tolerant  program  behaves  like 
its  fault-intolerant  version,  i.e.,  it  satisfies  its  timing  constraints  in  the  absence  of  faults.  On  the  other  hand,  a  hard  fault-tolerant 
program  must  satisfy  its  timing  constraints  even  in  the  presence  of  faults.  In  other  words,  in  hard  fault-tolerant  programs,  the 
demand  for  hard  real-time  processing  merges  with  catastrophic  consequences  of  systems,  whereas  in  soft  fault-tolerance  the 
catastrophic  consequences  are  not  related  to  the  program’s  timing  constraints.  Furthermore,  for  nonmasking,  soft-masking,  and 
hard-masking  levels  of  fault-tolerance,  we  distinguish  two  cases  where  state  correction  is  achieved  in  bounded  or  unbounded 
amount  of  time.  In  other  words,  we  consider  unbounded  and  bounded-time  recovery  in  the  presence  of  faults. 

In  order  to  formally  express  timing  constraints  of  a  real-time  program,  we  focus  on  a  standard  property  in  real-time  systems 
called  bounded  response.  Intuitively,  a  bounded  response  property  specifies  if  a  state  predicate,  say  P,  becomes  true  then 
another  state  predicate,  say  Q,  must  become  true  within  a  bounded  amount  of  time,  say  6.  Observe  that  according  to  [2, 3], 
such  properties  fall  in  the  category  of  safety  properties.  Thus,  in  this  paper  the  notion  safety  specification  consists  of  a  timed 
part  (i.e.,  a  set  of  bounded  response  properties)  and  an  untimed  part  (e.g.,  state  perturbations),  which  is  modeled  by  a  set  of  bad 
transitions. 
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Contributions  of  the  paper.  In  this  paper,  we  concentrate  on  the  necessity  of  state  predicate  detection  and  state  predicate 
correction  within  a  pre-specified  bounded  amount  of  time.  That  is,  we  show  that  every  fault-tolerant  real-time  program  that 
reuses  its  fault-intolerant  version  must  contain  detectors  or  (weak/strong)  (5-correctors.  The  main  results  from  this  paper  are  as 
follows: 

•  We  precisely  define  what  it  means  for  a  fault-tolerant  real-time  program  to  contain  a  component.  Our  definition  of 
containment  for  (weak/strong)  (5-correctors  is  similar  to  that  in  [1],  However,  for  detectors,  our  definition  of  containment 
is  more  rigorous  than  that  in  [1]  in  that  it  precisely  identifies  how  predicates  being  detected  are  related  to  the  fault- 
intolerant  program  and  how  the  output  of  the  detector  (called  witness  predicate)  is  being  used  by  the  fault-tolerant 
program. 

•  A  nonmasking  fault-tolerant  program  with  recovery  time  9  is  one  that  recovers  to  a  state  from  where  its  subsequent 
computation  satisfies  both  (timed  and  untimed)  safety  and  liveness  within  9  time  units.  However,  safety  may  be  violated 
before  the  program  reaches  such  a  state.  We  show  that  if  a  program  satisfies  both  the  safety  and  the  liveness  specifications 
within  9,  then  there  exists  5,  where  <5  is  a  function  of  9,  such  that  the  program  contains  strong  5-correctors. 

•  A  hard-failsafe  fault-tolerant  program  is  one  that  always  satisfies  both  untimed  and  timed  parts  of  its  safety  specification. 
Regarding  hard-failsafe  fault-tolerance,  we  show  that  hard-failsafe  fault-tolerant  programs  contain  detectors  to  satisfy 
the  untimed  part  and  weak  5-correctors  to  satisfy  the  timed  part  of  their  safety  specification  for  some  5. 

•  A  soft-masking  fault-tolerant  program  with  recovery  time  9  is  one  that  recovers  to  a  state  from  where  its  subsequent 
computation  satisfies  both  safety  and  liveness  within  9  time  units.  Moreover,  until  such  a  state  is  reached,  only  the  un¬ 
timed  part  of  its  safety  specification  is  not  violated.  We  show  that  soft-masking  fault-tolerant  programs  contain  detectors 
and  strong  5-correctors  for  some  5,  where  5  is  a  function  of  9. 

•  A  hard-masking  fault-tolerant  program  with  recovery  time  9  is  a  soft-masking  program  with  recovery  time  9  that  satisfies 
both  untimed  and  timed  parts  of  its  safety  specification  during  recovery.  We  show  that  hard-masking  fault-tolerant 
programs  contain  detectors,  weak  5i -correctors,  and  strong  52-correctors  for  some  5i  and  52,  where  52  is  a  function  of  9. 

Organization  of  the  paper.  The  rest  of  the  paper  is  organized  as  follows:  In  Section  2,  we  formally  define  the  notions  of  real¬ 
time  programs,  faults,  and  fault-tolerance.  Then,  in  Section  3,  we  present  the  formal  definition  of  fault-tolerance  components 
for  real-time  programs,  namely,  detectors  and  weak/strong  5-correctors.  In  Section  4,  we  develop  the  theory  of  the  components 
and  we  show  the  necessity  of  their  existence  in  real-time  programs  with  respect  to  different  levels  of  fault-tolerance.  Finally, 
in  Section  5,  we  make  concluding  remarks  and  discuss  future  work. 

2  Real-Time  Programs,  Specifications,  and  Fault-Tolerance 

In  this  section,  we  give  formal  definitions  of  real-time  programs,  specifications,  faults,  and  fault-tolerance.  The  definition  of 
real  programs  is  adapted  from  Alur  and  Henzinger  [4]  and  Alur  and  Dill  [5],  The  definition  of  specifications  is  based  on  the 
work  by  Alpern  and  Schneider  [2]  and  Henzinger  [3].  Finally,  the  notion  of  faults  and  fault-tolerance  is  due  to  Arora  and 
Gouda  [6],  and  Bonakdarpour  and  Kulkarni  [7]. 
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2.1  Real-Time  Programs 


Let  V  be  a  finite  set  of  discrete  variables  and  W  be  a  finite  set  of  clock  variables.  Each  discrete  variable  is  associated  with  a 
finite  domain  D  of  values.  A  location  is  a  function  that  maps  each  discrete  variable  to  a  value  from  its  respective  domain.  A 
clock  constraint  over  the  set  W  of  clock  variables  is  a  Boolean  combination  of  formulas  of  the  form  x<coxx  —  yfc,  where 
x,  y  £  W,  c  £  Z>(,,  and  A  is  either  <  or  <.  We  denote  the  set  of  all  clock  constraints  over  W  by  <I> ( 1L ) .  A  clock  valuation  is 
a  function  v  :  W  — >  R>o  that  assigns  a  real  value  to  each  clock  variable.  Furthermore,  for  r  £  R>o,  v  +  r  =  v(x)  +  r  for 
every  clock  x.  Also,  for  A  C  W,  v[\  :=  0]  denotes  the  clock  valuation  for  W  which  assigns  0  to  each  x  £  A  and  agrees  with  v 
over  the  rest  of  the  clock  variables  in  W. 

A  state  (denoted  a)  is  a  pair  (s,  v),  such  that  s  is  a  location  and  v  is  a  clock  valuation  for  W  at  location  s.  The  set  of  all 
possible  states  is  called  the  state  space  obtained  from  the  associated  variables. 

Definition  2.1  (computations)  A  computation  is  a  finite  or  infinite  timed  state  sequence  of  the  form: 

<T  =  (<J0,  To)  ->  (0l,Tl)  -»•  •  •  • 

where  a,  is  a  state  in  the  state  space  obtained  from  the  associated  variables,  for  all  i  £  Z>o,  and  the  sequence  T(jT-|  ■  ■  ■  (called 
global  time )  satisfies  the  following  constraints: 

•  Initialization',  tq  >  0, 

•  Monotonicity :  t*  <  Tj+i  for  all  i  £  Z>o,  and 

•  Divergence :  if  W  is  infinite,  for  all  t  £  R>o,  there  exists  j  such  that  Tj  >  " 

Notice  that  in  Definition  2.1,  we  do  not  specify  an  initial  value  for  global  time.  It  follows  that  in  a  computation  <7,  tq  can 
be  assigned  any  value  from  R>o-  Let  Tj  +  t  denote  the  computation  with  time  shift  f  £  R,  such  that  to  >  0  (i.e.,  n  becomes 
Ti+t  for  all  i  >  0).  Also,  let  S  be  any  set  of  computations.  For  all  a  £  S,  we  require  that  for  all  time  shifts  t  £  R,  a  +  t  be  in 
S  as  well. 

Definition  2.2  (suffix  and  fusion  closure)  Suffix  closure  of  a  set  of  computations  means  that  if  a  computation  o  is  in  that 
set  then  so  are  all  the  suffixes  of  a.  Fusion  closure  of  a  set  of  computations  means  that  if  computations  (a,  (a,  r) ,  7}  and 
(/3,  (cr,  r),  ip)  are  in  that  set  then  so  are  the  computations  (a,  (a,  r),ip )  and  (/3,  (a,  r),  7),  where  a  and  (3  are  finite  prefixes  of 
computations,  7  and  L  are  suffixes  of  computations,  and  cr  is  a  state  at  global  time  r.  I 

Notation.  Let  al  denote  the  pair  ( a, .  r,  )  in  computation  a.  Also,  let  a  and  [3  be  finite  computations,  where  the  length  of  a 
is  n.  The  concatenation  of  a  and  (3  (denoted  a/3 )  is  a  computation,  where  either  clock  variables  (except  possibly  a  subset  that 
is  reset)  and  global  time  of  a„_i  and  (30  are  equal,  but  their  locations  are  different,  or  their  locations  are  equal,  but  the  clock 
variables  and  global  time  are  advanced  equally.  If  T  and  T  are  two  sets  of  finite  computations,  then  L'T  =  {a/3  :  a  £  T  and 

p£^}. 

Definition  2.3  (real-time  programs)  A  real-time  program  p  is  specified  by  a  set  of  discrete  variables,  a  set  of  clock  variables, 
and  a  suffix  closed  and  fusion  closed  set  of  maximal  computations  in  the  state  space  of  p  (denoted  Sp).  By  maximal,  we  mean 
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that  if  a  =  a/3,  where  the  prefix  a  =  (cto,t0)  — >  (01,  n)  — >  •  •  •  (an,  rn)  and  the  infinite  suffix  f3  =  {{sn+\,  i/n+i),  t„+i)  — > 
((sn+i,  i/n+2),  Tn+2)  ->  ((sn+i,  i/n+3),  rn+3)  •  •  • ,  is  a  computation  of  p  such  that  ^i+i  =  ^  +  (ri+i  -  tj),  for  all  j  >  n , 

then  there  is  no  other  computation  of  p  that  has  a  prefix  of  a.  In  other  words,  given  a  finite  computation  prefix  a  of  p,  p  does 
not  contain  the  computation  that  stutters  crra+1  infinitely  if  there  is  any  other  computation  of  p  that  extends  a.  I 

Notation.  For  simplicity,  we  use  the  pseudo-arithmetic  expressions  to  denote  timing  constraints  over  finite  computations. 
For  instance,  a<$,  where  8  £  Z>o,  denotes  a  finite  computation  (uo,  To)  — >  (<7i,  ti)  — >  •  •  •  (crn,  rn)  that  satisfies  the  timing 
constraint  t„  —  To  <  S. 

Definition  2.4  (state  predicates)  A  state  predicate  of  p  is  a  subset  of  Sp.  We  say  that  a  state  predicate  S  is  closed  in  p  iff  in 
every  computation,  (oq,  to)  — >  (<ti,  t\)  — >  •  •  • ,  of  p,  if  S  is  true  in  state  a j,  j  €  Z> 0,  (denoted  cr3  |=  S)  then  S  remains  true 
for  all  states  <Jk ,  where  k  >  j  (i.e..  |=  S).  I 

Definition  2.5  (S'-computations)  Let  S'  be  a  state  predicate  and  p  he  a  program.  The  S-computations  of  p,  denoted  as  p  S, 
is  the  set  of  computations  of  p  that  start  in  a  state  where  S  is  true.  I 

Notice  that  since  the  set  of  computations  of  a  program  is  suffix  closed  and  fusion  closed,  the  program  can  be  written  in 
terms  of  transitions  that  it  can  execute  [1],  Hence,  we  can  view  a  program  p  as  a  set  of  transitions.  To  concisely  write  the 
transitions  of  a  program,  we  use  timed  guarded  commands  [4],  A  timed  guarded  command  (also  called  timed  actions )  is  of  the 
form  L  ::  Guard  ^4  statement,  where  L  is  a  label.  Guard  is  a  state  predicate,  statement  is  a  statement  that  describes  how  the 
program  state  is  updated,  and  A  is  a  set  of  clock  variables  that  are  reset  by  execution  of  L.  Thus,  L  denotes  the  set  of  transitions 
{(uo,  <ti)  |  Guard  is  true  in  state  op,  <j\  is  obtained  by  resetting  the  clock  variables  in  A  and  changing  00  as  prescribed  by 
statement}.  A  guarded  wait  command  (also  called  delay  action )  is  of  the  form  Guard  — >  wait,  where  Guard  identifies  the 
states  where  a  delay  transition  is  allowed  to  be  taken  (i.e.,  the  program  stutters  in  a  location  and  lets  time  advance).  A  guarded 
wait  command  delays  the  program  by  an  arbitrary  amount  of  time,  as  long  as  Guard  remains  true.  We  present  examples  of 
timed  guarded  commands  in  Section  4. 

2.2  Specifications 

Similar  to  a  program,  a  specification  (also  called  property )  is  specified  by  sets  of  discrete  and  clock  variables  (respectively, 
state  space)  and  a  suffix  closed  and  fusion  closed  set  of  (finite  or  infinite)  computations  over  the  state  space  obtained  from  those 
variables. 

In  order  to  capture  the  real-time  properties  of  programs  (e.g.,  deadlines  and  recovery  time),  in  this  paper,  we  focus  on  a 
standard  property  of  real-time  systems  called  stable  bounded  response. 

Definition  2.6  (stable  bounded  response)  Let  P  and  Q  be  two  state  predicates.  A  stable  bounded  response  property  (denoted 
P  |— ><<5  Q)  is  the  set  of  all  computations  (00,  to)  — >  (<ti,  ti)  — >  •  •  •  in  which  if  o ,  |=  P,  for  some  i  >  0,  then  there  exists  j, 
j  >  i,  such  that  (1)  <jj  |=  Q,  (2)  t;  —  n  <  8,  and  (3)  for  all  k,i  <  k  <  j,  crk  |=  P.  In  other  words,  it  is  always  the  case  that  a 
state  in  P  is  followed  by  a  state  in  Q  within  8  time  units  and  P  continuously  remains  true  until  Q  becomes  true.  We  call  P  the 
event  predicate,  Q  the  response  predicate,  and  8  the  response  time.  I 
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Assumption  2.7  We  assume  that  the  set  of  clock  variables  of  any  stable  bounded  response  property  P  *-><$  Q  contains  a 
special  clock  variable,  which  is  reset  whenever  P  becomes  true.  This  assumption  is  necessary  to  ensure  that  stable  bounded 
response  properties  are  fusion  closed.  I 

The  specifications  considered  in  this  paper  (denoted  SPEC )  are  an  intersection  of  the  safety  specification  and  the  liveness 
specification,  defined  next. 

Definition  2.8  (safety  specification)  We  define  the  safety  specification  by  a  set  of  computations  based  on  (1)  a  set  SPECbt 
of  instantaneous  bad  transitions  of  the  form  (so,a)  — >  (si,i/[A  :=  0]}  where  so  and  si  are  two  locations  and  A  is  a  set 
of  clock  variables,  and  (2)  a  set  SPECbr  of  to  bounded  response  properties  of  the  form  (Pi  i— ><(5l  Qx)  A  (P2  1 — *-<52 
Qf)  A  ...  A  ( Pm  1— ><<5m  Qm)-  for  some  to,  to  >  1.  Precisely,  the  safety  specification  is  the  intersection  of  the  following 
sets: 

1.  the  set  of  computations  where  no  prefix  contains  a  transition  in  SPECbt ,  and 

2.  the  intersection  of  sets  of  computations  corresponding  to  each  stable  bounded  response  property  Pi  1 — ^  Q,  in  SPECbr , 

where  1  <  i  <  to.  I 

Notation.  With  abuse  of  notation  for  simplicity,  throughout  the  paper,  whenever  we  refer  to  SPECbt ,  we  mean  the  corre¬ 
sponding  set  of  computations  that  do  not  contain  a  transition  in  SPECbt- 

Definition  2.9  (liveness  specification)  A  liveness  specification  is  a  set  of  computations  that  meets  the  following  condition: 
for  each  finite  computation  oi  there  exists  a  computation  6  such  that  a/3  is  in  that  set.  I 

Notation.  We  use  S*  to  denote  a  finite  computation  (<to,to)  — >  (<ti,ti)  — ►  •  •  •  (an,  Tn)  such  that  crt  [=  S  for  all  i,  where 
0  <  i  <  n.  Thus,  (true)*  denotes  an  arbitrary  finite  computation. 

Now,  we  define  what  it  means  for  program  p  to  refine  a  specification  SPEC .  and  what  it  means  for  program  p'  (typically, 
a  fault-tolerant  program)  to  refine  program  p  (typically,  a  fault-intolerant  program).  Essentially,  we  would  like  to  define  'p' 
refines  p ’  iff  computations  of  p'  are  a  subset  of  that  in  p.  However,  if  p'  is  obtained  by  adding  fault-tolerance  to  p  then  p'  may 
contain  additional  variables  that  are  not  in  p.  Hence,  it  will  be  necessary  to  project  the  computations  of  p'  on  (the  variables 
of)  p  and  then  check  if  the  projected  computation  is  a  computation  of  p.  More  precisely,  the  projection  of  a  state  of  p'  on 
p  (respectively,  SPEC)  is  a  state  obtained  by  considering  only  the  (discrete  and  clock)  variables  of  p  (respectively,  SPEC). 
Extending  this  definition  for  computations,  we  say  that  the  projection  of  a  computation  of  p'  on  p  (respectively,  SPEC)  is  a 
computation  obtained  by  projecting  each  state  in  that  computation  on  p  (respectively,  SPEC). 

Definition  2.10  (refines)  We  say  that  p'  refines  p  (respectively,  SPEC)  from  S  iff  the  following  two  conditions  hold:  (1)5 
is  closed  in  p\  and  (2)  For  every  computation  of  p'  that  starts  in  a  state  where  S  is  true,  the  projection  of  that  computation  on  p 
(respectively,  SPEC)  is  a  computation  of  p  (respectively,  SPEC).  I 

We  also  define  the  notion  of  maintains  for  finite  computations.  Specifically,  given  a  finite  prefix  a,  a  maintains  SPEC 
captures  that  the  specification  is  not  yet  violated  in  a. 
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Definition  2.11  (maintains)  Let  a  be  a  prefix  of  a  computation  of  p.  We  say  that  a  maintains  SPEC  iff  there  exists  a 
computation  (3  such  that  the  projection  of  a/3  on  SPEC  is  in  SPEC .  I 

Informally  speaking,  proving  the  correctness  of  p  with  respect  to  SPEC  involves  showing  that  p  refines  SPEC  from  some 
state  predicate  S ,  where  S  ^  {}.  We  call  such  a  state  predicate  S  an  invariant  of  p. 

Definition  2.12  (invariant)  Let  S'  be  a  state  predicate.  We  say  that  S  is  an  invariant  of  p  for  SPEC  iff  p  refines  SPEC  from 
SandS^{}.l 

2.3  Faults  and  Fault-Tolerance  in  Real-Time  Programs 

Intuitively,  the  faults  that  a  program  is  subject  to  are  systematically  represented  by  the  union  of  transitions  whose  execution 
perturbs  the  program  state  and  transitions  that  unexpectedly  advance  time.  While  state  corruption  faults  may  indirectly  cause 
waste  of  time,  delay  faults  directly  cause  waste  of  time  in  the  sense  that  they  defer  the  occurrence  of  some  desirable  event  by 
some  amount  of  time.  For  instance,  a  processor  crash  may  require  a  scheduler  to  assign  another  processor  to  a  set  of  jobs.  It  is 
natural,  to  model  the  delay  in  start  time  of  such  jobs  by  delay  faults  that  only  advance  the  value  of  clock  variables.  Formally, 
we  model  faults  as  a  set  of  transitions  over  (discrete  and  clock)  variables  of  p. 

Definition  2.13  (faults)  For  a  program  p  with  state  space  Sp ,  the  set  /  of  faults  is  specified  by  the  union  of  the  following  two 
sets: 

1.  the  set  fs  of  immediate  faults  of  the  from  (so,  v)  — >  (si,  v[\  :=  0]}  where  so  and  si  are  two  locations  and  A  is  a  set  of 
(possibly  empty)  clock  variables,  and 

2.  a  set  /*  of  delay  faults  of  the  form  (s,  v)  — >  (s,  v  +  r),  which  keeps  the  program  in  a  location  for  some  time  r  £  K>o- 

We  now  define  what  we  mean  by  computations  of  a  program  in  the  presence  of  faults.  Given  a  program  p  and  faults 
/,  we  define  the  computations  of  p  in  the  presence  of  /  by  finite  fusion-closure  of  the  computations  of  p  U  /  as  follows. 
Let  Z  be  a  set  of  computations  and  •  be  the  operator  that  fuses  two  (finite  or  infinite)  computations  of  Z  such  that  •(Z)  = 
{a(cr,  r)/3  |  37 ,ip  :  (a(<r,  r) 7  £  Z)  A  r)f3  £  Z)j.  Also,  let  FFC{Z)  be  the  smallest  fixpoint  of  U“0  •*  (Z).  Now, 
we  define  the  computations  of  p  in  the  presence  of  f  (denoted  p\\f)  as  FFC{p  U  /). 

Assumption  2.14  Observe  that  the  above  formulation  of  program  computations  in  the  presence  of  faults  guarantees  that  the 
number  of  occurrence  of  faults  in  a  computation  is  finite.  In  this  paper,  however,  since  we  deal  with  real-time  programs  and 
our  goal  is  to  identify  components  that  provide  “bounded-time”  recovery  in  the  presence  of  faults,  we  assume  that  the  number 
of  occurrence  of  faults  in  all  computations  is  bounded  by  some  number  k  £  Z>(J.  This  assumption  is  reasonable  in  many 
commonly  considered  fault-tolerant  real-time  programs.  In  fact,  it  can  be  shown  that  providing  bounded-time  recovery  in  the 
presence  of  unbounded  number  of  faults  is  not  possible.  I 

Just  as  we  use  invariants  to  show  program  correctness  in  the  absence  of  faults,  we  us e  fault-spans  to  show  the  correctness 
of  programs  in  the  presence  of  faults. 
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Definition  2.15  (fault-span)  Let  S  and  T  be  state  predicates  and  /  be  a  set  of  fault  transitions.  We  say  that  T  is  an  f-span  of 
pfrom  S  iff  S  C  T,  and  T  is  closed  in  p[]/.  I 

Notation.  Henceforth,  if  the  specification  or  projection  of  a  program  on  SPEC  is  clear  from  the  context,  we  omit  it.  For 
example,  “S  is  an  invariant  of  p”  abbreviates  “S  is  an  invariant  of  p  for  SPEC ”.  Likewise,  “a  computation  of  p  is  in  SPEC ” 
abbreviates  “the  projection  of  that  computation  on  SPEC  is  in  SPEC ”.  Similarly,  “T  is  an  /-span  of  p”  abbreviates  “T  is  an 
/-span  of  p  from  S”. 

We  now  define  what  we  mean  by  levels  of  fault-tolerance  in  the  context  of  real-time  programs.  Obviously,  in  the  absence  of 
faults,  a  program  should  refine  its  specification.  In  the  presence  of  faults,  however,  it  may  not  refine  its  specification  and,  hence, 
it  may  refine  some  (possibly)  weaker  ‘tolerance  specification’ .  These  specifications  are  based  on  satisfaction  of  a  combinations 
of  safety,  liveness,  timing  constraints,  and  a  desirable  bounded-time  recovery  mechanism  in  the  presence  of  faults.  Intuitively, 
we  consider  three  levels  of  fault-tolerance,  namely  failsafe,  nonmasking,  and  masking,  based  on  satisfaction  of  safety  and 
liveness  properties  in  the  presence  of  faults.  For  failsafe  and  masking  fault-tolerance,  we,  furthermore,  consider  two  additional 
levels,  namely  soft  and  hard,  based  on  satisfaction  of  timing  constraints  in  the  presence  of  faults.  Moreover,  nonmasking  and 
masking  fault-tolerant  programs  are  associated  with  a  required  recovery  time,  which  can  be  bounded  or  unbounded. 

To  motivate  the  idea  of  soft  and  hard  fault-tolerance,  let  us  consider  the  railroad  crossing  problem.  Suppose  that  a  train  is 
approaching  a  railroad  crossing.  The  safety  specification  requires  that  “if  the  train  is  crossing,  the  gate  must  be  closed”.  Also, 
a  stable  bounded  response  property  requires  that  “once  the  gate  is  closed,  it  should  reopen  within  5  minutes”.  In  this  example, 
it  may  be  catastrophic  if  the  train  is  crossing  while  the  gate  is  open  due  to  occurrence  of  faults.  On  the  other  hand,  if  the 
gate  remains  closed  for  more  than  5  minutes  due  to  occurrence  of  faults,  the  outcome  is  not  disastrous.  Thus,  depending  upon 
the  outcome  of  violation  of  a  safety  specification,  the  desired  level  of  fault-tolerance  changes.  Hence,  in  the  railroad  crossing 
problem  the  system  must  tolerate  faults  that  cause  the  gate  to  remain  open  while  the  train  is  crossing.  We  call  such  a  system  soft 
fault-tolerant.  Intuitively,  a  soft  fault-tolerant  real-time  program  is  not  required  to  satisfy  its  timing  constraints  in  the  presence 
of  faults. 

Now,  consider  a  system  that  controls  internal  pressure  of  a  boiler.  Suppose  that  in  this  system,  the  safety  specification 
requires  that  once  a  pressure  gauge  reads  30  pounds  per  square  inch,  the  controller  must  issue  a  command  to  open  a  valve 
within  20  seconds.  In  such  a  system,  if  occurrence  of  faults  causes  the  controller  not  to  respond  within  the  required  time,  the 
outcome  may  be  disastrous.  Thus,  our  boiler  controller  must  satisfy  its  timing  constraints  even  in  the  presence  of  faults.  In 
other  words,  the  boiler  controller  must  be  hard  fault-tolerant.  Intuitively,  a  hard  fault-tolerant  real-time  program  must  satisfy 
its  timing  constraints  even  in  the  presence  of  faults.  In  fact,  in  hard  fault-tolerant  programs,  the  demand  for  hard  real-time 
processing  merges  with  catastrophic  consequences  of  systems,  whereas  in  soft  fault-tolerance  the  catastrophic  consequences 
are  not  related  to  the  program’s  timing  constraints. 

Below,  we  define  tolerance  specifications  that  often  occur  in  practice.  Let  SPEC  be  the  specification  obtained  by  the 
intersection  of  SPECbt ,  SPECbr ,  and  liveness  specification  as  defined  in  Subsection  2.2. 

Definition  2.16  (soft  and  hard-failsafe  tolerance  specification)  The  soft-failsafe  tolerance  specification  of  SPEC  is  the 
smallest  safety  specification  containing  SPECbt.  (denoted  SSPECbt. )■  The  hard-failsafe  tolerance  specification  of  SPEC  is 
the  intersection  of  SSPECbt  and  the  smallest  specification  containing  SPECbr  (denoted  SSPECbr)-  I.e,  The  hard-failsafe 


tolerance  specification  of  SPEC  is  SSPECbt  H  SSPECbr ■  ■ 


Definition  2.17  (nonmasking  tolerance  specification)  Since  in  nonmasking  fault-tolerance  it  is  not  required  to  satisfy 
the  safety  specification  in  the  presence  of  faults,  the  nonmasking  tolerance  specification  of  SPEC  with  recovery  time  9  is 
{true)*<eSPEC .  I 

Definition  2.18  (soft  and  hard-masking  tolerance  specification)  The  soft-masking  tolerance  specification  of  SPEC  with 
recovery  time  9  is  SPEC  bt<e  SPEC .  The  hard-masking  tolerance  specification  of  SPEC  with  recovery  time  9  is  SPEC . 

Remark.  Notice  that  the  hard-masking  tolerance  specification  is  independent  of  the  recovery  time  9.  This  is  because  unlike 
nonmasking  and  soft-masking,  in  hard-masking  fault-tolerance  the  entire  specification  (i.e.,  SPEC)  must  always  be  satisfied 
and,  hence,  recovery  time  becomes  only  a  matter  of  the  amount  of  time  that  the  program  is  in  states  outside  its  invariant. 

Using  these  definitions,  we  are  now  ready  to  define  what  it  means  for  a  program  to  tolerate  a  fault-class  /.  With  the  intuition 
that  a  program  is  /-tolerant  to  SPEC  if  it  refines  SPEC  in  the  absence  of  faults  and  it  refines  a  tolerance  specification  of  SPEC 
in  the  presence  of  /,  we  define  ‘/-tolerant  to  SPEC  from  S'  as  follows. 

Definition  2.19  (fault-tolerant  programs)  We  say  that  p  is  soft/hard-failsafe  f  -tolerant  to  SPEC  from  S  (respectively, 
nonmasking  or  soft/hard-masking  /-tolerant  to  SPEC  with  recovery  time  9  from  S)  iff  the  following  two  conditions  hold: 

•  p  refines  SPEC  from  S ,  and 

•  there  exists  T  such  that  T  D  S  and  p[]f  refines  the  soft/hard-failsafe  tolerance  specification  of  SPEC  from  T  (respec¬ 
tively,  the  nonmasking  or  soft/hard-masking  tolerance  specification  of  SPEC  with  recovery  time  9  from  T).  I 

Note  that  for  nonmasking  and  masking  levels  of  fault-tolerance  one  can  choose  to  have  unbounded-time  recovery.  We  address 
the  effect  of  the  choice  of  recovery  time  in  Section  4. 

Notation.  In  the  sequel,  whenever  the  specification  SPEC  and  the  invariant  S  are  clear  from  the  context,  we  omit  them;  thus, 
“nonmasking  /-tolerant  with  recovery  time  9"  abbreviates  “nonmasking  /-tolerant  to  SPEC  with  recovery  time  9  from  S", 
and  so  on. 


3  Real-Time  Fault-Tolerance  Components 

In  this  section,  we  present  real-time  fault-tolerance  components,  namely,  detectors,  weak  5-correctors,  and  strong  5-correctors. 
Once  we  define  these  components,  in  Section  4,  we  present  the  relevance  of  each  component  to  the  levels  of  fault-tolerance 
introduced  in  Subsection  2.3. 

3.1  Detectors 

In  this  subsection,  we  formally  introduce  the  first  of  the  three  tolerance  components,  detectors.  We  will  develop  their  theory 
and  present  a  simple  altitude  switch  example  to  illustrate  an  instance  of  detectors  in  Subsection  4.2. 
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Definition  3.1  (detects)  Let  X  and  Z  be  state  predicates.  Let  ‘Z  detects  X'  be  the  specification,  that  is  the  set  of  all 
computations  a  =  (op,  To)  — >  (o i ,  ri)  — >  •  •  • ,  satisfying  the  following  three  conditions: 

•  ( Safeness )  For  each  i,i  >  0,  if  a,  |=  Z  then  <7;  |=  X.  (In  other  words,  di  j=  (Z  =>  X).) 

•  ( Progress )  For  each  t,  *  >  0,  if  a,  |=  X  then  there  exists  k,k  >  i,  such  that  op  ^  Z  or  op,  ^  X. 

•  ( Stability )  There  exists  i,  i  >  0,  such  that  for  all  j,  j  >  i,  if  a3  |=  Z  then  crj+i  (=  Z  or  (Jj+i  \f=-  X.  I 

Definition  3.2  (detector)  Z  detects  X  in  d  from  U  iff  d  refines  ‘Z  detects  X’  from  U.  I 

A  detector  d  is  used  to  check  whether  its  “detection  predicate”,  X,  is  true.  Since  d  satisfies  Progress  from  U,  in  any 
computation  in  d.  if  U  A  X  is  true  continuously,  d  eventually  detects  this  fact  and  makes  Z  true.  Since  d  satisfies  Safeness 
from  U,  it  follows  that  d  never  lets  Z  witness  X  incorrectly.  Moreover,  since  d  satisfies  Stability  from  U,  it  follows  that  once 
Z  becomes  true,  it  continues  to  be  true  unless  X  is  falsified. 

Definition  3.3  (tolerant  detector)  d  is  a  soft/hard-failsafe  (respectively,  nonmasking  or  soft/hard-masking)  tolerant  detector 
(respectively,  with  recovery  time  9)  to  ‘ Z  detects  X’  from  U  iff  d  refines  the  soft/hard-failsafe  (respectively,  nonmasking  or 
soft/hard-masking)  tolerance  specification  of  ‘ Z  detects  X'  (respectively,  with  recovery  time  9)  from  U.  I 

3.2  ^-Correctors 

In  this  subsection,  we  formally  introduce  the  second  of  the  two  tolerance  components,  8-correctors.  We  will  develop  their 
theory  and  present  examples  to  illustrate  an  instance  of  5-correctors  in  subsections  4.1,  4.2,  and  4.3. 

Definition  3.4  (weakly  corrects)  Let  X  and  Z  be  state  predicates.  Let  '  Z  weakly  corrects  X  within  S'  be  the  specification, 
that  is  the  set  of  all  computations  a  =  (op,  To)  ~ >  (op,  Ti)  — >  •  •  • ,  satisfying  the  following  conditions: 

•  (Week  Convergence )  There  exists  i,  i  >  0,  such  that  a  \=  X  and  (r,  —  To)  <  8. 

•  ( Safeness )  For  each  i,i  >  0,  if  at  |=  Z  then  crt  |=  X.  (In  other  words,  Oi  |=  (Z  =>  X).) 

•  ( Progress )  For  each  i,i  >  0,  if  a%  |=  X  then  there  exists  k,k  >i,  such  that  au  |=  Z  or  Ok  \f=-  X. 

•  ( Stability )  There  exists  i,  i  >  0,  such  that  for  all  j,  j  >  i,  if  <jj  |=  Z  then  aj+ ±  |=  Z  or  <jj+i  X.  I 

Definition  3.5  (weak  5-corrector)  Z  weekly  corrects  X  within  8  in  c  from  U  iff  c  refines  ‘Z  weakly  corrects  X  within 
S'  from  U.  I 

Definition  3.6  (strongly  corrects)  Let  X  and  Z  be  state  predicates.  Let  ‘Z  strongly  corrects  X  within  S'  be  the  specification, 
that  is  the  set  of  all  computations  a  =  (op,  To)  — >  (op,  ti)  —>•••,  satisfying  the  following  two  conditions: 

•  Z  weakly  corrects  X  within  5,  and 

•  ( Strong  Convergence )  In  addition  to  Weak  Convergence,  X  is  closed  in  o.  I 
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Definition  3.7  (strong  5-corrector)  We  say  that  Z  strongly  corrects  X  within  8  in  c  from  U  iff  c  refines  ‘Z  strongly 
corrects  X  within  S'  from  U.  I 

Since  c  satisfies  Weak  (respectively.  Strong)  Convergence  from  U,  it  follows  that  c  reaches  a  state  where  X  becomes  true 
within  8  time  units  (and,  respectively,  X  continues  to  be  true  thereafter).  Moreover,  since  c  satisfies  Safeness  from  U,  it  follows 
that  a  corrector  never  lets  the  predicate  Z  witness  the  correction  predicate  X  incorrectly.  Since  c  satisfies  Progress  from  U,  it 
follows  that  eventually  Z  becomes  true.  And,  finally,  since  c  satisfies  Stability  from  U,  it  follows  that  when  Z  becomes  true,  Z 
is  never  falsified. 

Definition  3.8  (tolerant  5-corrector)  c  is  a  soft/hard-failsafe  (respectively,  nonmasking  or  soft/hard-masking)  tolerant 
weak/strong  8-corrector  (respectively,  with  recovery  time  9)  to  ‘Z  weakly/strongly  corrects  X  within  S'  from  U  iff  c 
refines  the  soft/hard-failsafe  (respectively,  nonmasking  or  soft/hard-masking)  tolerance  specification  (respectively,  with  recov¬ 
ery  time  0)  of  X  weakly/strongly  corrects  X  within  5’  from  U.  I 

Remark.  If  the  witness  predicate  Z  is  identical  to  the  correction  predicate  X,  our  definition  of  the  corrects  relation  reduces  to 
one  given  by  Arora  and  Gouda  [6],  We  have  considered  this  more  general  definition  to  accommodate  the  case  — which  occurs 
for  instance  in  soft/hard-masking  tolerance —  where  the  witness  predicate  Z  can  be  checked  atomically  but  the  correction 
predicate  X  cannot. 

4  The  Necessity  of  Detectors  and  ^-Correctors  in  Fault-Tolerant  Real-Time  Pro¬ 
grams 

In  this  section,  we  develop  the  theory  of  detectors  and  5-correctors.  More  specifically,  we  show  the  necessity  of  existence  of  a 
combination  of  these  components  in  different  levels  of  fault-tolerance  introduced  in  Subsection  2.3. 

Observe  that,  in  the  definition  of  tolerance  specifications  in  Subsection  2.3,  when  the  required  recovery  time  for  nonmasking 
and  soft-masking  fault-tolerance  is  unbounded,  the  resulting  levels  of  fault-tolerance  reduce  to  the  ones  addressed  by  Arora 
and  Kulkarni  in  [1],  Moreover,  since  soft-failsafe  programs  are  not  required  to  satisfy  their  timing  constraints  in  the  presence 
of  faults,  this  level  reduces  to  the  one  considered  in  [1]  as  well.  Thus,  in  the  following  subsections,  out  of  eight  possible  levels 
of  fault-tolerance  introduced  in  Subsection  2.3,  we  only  consider  the  ones  that  are  not  addressed  in  [1], 

4.1  The  Role  of  Strong  5-Correctors  in  Nonmasking  Fault-Tolerance 

In  this  subsection,  we  start  developing  the  theory  of  strong  5-correctors.  Subsequently,  we  build  upon  an  example  to  illustrate  an 
instance  of  5-correctors.  Intuitively,  our  goal  is  to  show  that  if  a  program  is  nonmasking  /-tolerant  then  it  contains  nonmasking 
tolerant  strong  5-correctors.  We  present  this  result  in  Theorem  4.4.  Throughout  this  section,  let  p  be  a  program,  SPEC  be  a 
specification,  and  o  be  a  state. 

In  order  to  prove  our  goal,  first,  in  Claim  4.2,  we  show  that  if  a  program  refines  a  specification  within  9  time  units  then  it 
contains  strong  5-correctors  for  some  5.  More  specifically,  let  p  be  a  program  that  refines  SPEC  from  S.  In  Claim  4.2,  we 
show  that  if  p'  is  designed  such  that  it  behaves  like  p  within  9  and,  thus,  has  a  suffix  in  SPEC ,  then  p'  is  a  strong  5-corrector 
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of  an  invariant  predicate  of  p  for  some  8.  We  prove  this  Claim  by  showing  that  p'  itself  refines  the  required  strong  ((-corrector 
specification. 

Definition  4.1  (becomes)  We  say  that  p’  becomes  p  within  9  from  T  iff  p'  refines  (t/ruey^f)p  from  T.  ■ 

Claim  4.2  Programs  that  refine  a  specification  within  9  time  units  contain  strong  8-correctors.  That  is, 

if 

•  p  refines  SPEC  from  S, 

•  p'  refines  pfrom  S,  and 

•  p1  becomes  p\S  within  9  from  T 

then 

•  there  exists  8  £  Z>o  such  that  p'  is  a  strong  8-corrector  of  an  invariant  predicate  of  p. 

Proof. 

Let  X  =  S,  and 

Z  =  /S'A{cr|(7isa  state  of  p'  such  that  er  is  reached  in  some  computation  of  p'  starting  from  T}. 

Since  p  refines  SPEC  from  S,  it  follows  that  X  is  an  invariant  predicate  of  p  for  SPEC .  Now,  we  show  that  there  exists 
8  £  Z>o  such  that  p'  refines  ‘ Z  strongly  corrects  X  within  S’  from  T: 

•  By  definition  of  Z ,  in  any  state  where  Z  is  true,  S  is  true.  Thus,  Safeness  is  satisfied. 

•  Since  p'  becomes  p\S  within  9  from  T,  i.e.,  p'  refines  ( true)*<e(p\S )  from  T,  every  computation  of  p'  starting  from  T 
will  reach  a  state  where  S  is  true  within  9.  By  definition  of  Z,  Z  is  true  in  this  state  as  well.  Thus,  Progress  is  satisfied. 

•  Since  p'  refines  p  from  S,  it  follows  that  S  is  closed  in  //.  Obviously,  the  second  conjunct  in  Z  is  closed  in  p'  as  well 
and,  thus,  Z  is  closed  in  p' .  Moreover,  since  we  are  guaranteed  that  Z  is  eventually  reached  in  any  computation.  Stability 
is  satisfied. 

•  Since  p'  refines  (true)*<e(p\S)  from  T,  every  computation  of  p'  starting  from  T  will  reach  a  state  where  S  is  true  within 
9.  Moreover,  S(=  X)  is  closed  in  p' .  Thus,  by  choosing  5  =  9 ,  Strong  Convergence  within  5  is  satisfied.  I 

The  next  lemma  generalizes  Claim  4.2.  In  general,  given  a  program  p  that  refines  SPEC  from  S,  p'  may  not  behave  like 
p  from  each  state  in  S  but  only  from  a  subset  of  S,  say  R.  This  may  happen,  for  example,  if  p'  contains  additional  variables 
and  p'  behaves  like  p  only  after  the  values  of  these  additional  variables  are  restored.  Lemma  4.3  shows  that  in  such  a  case,  p' 
contains  a  nonmasking  tolerant  strong  ((-corrector  of  an  invariant  predicate  of  p.  (The  strong  ((-corrector  is  nonmasking  in  the 
sense  that  the  correction  predicate  is  preserved  only  after  p'  reaches  a  state  where  R  is  true.) 

Lemma  4.3 

if 

•  p  refines  SPEC  from  S, 
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•  p'  refines  pfrom  R,  where  R  =>  S,  and 

•  p'  becomes  ( p\R )  within  9  from  T 

then 

•  there  exists  5  £  Z>o  such  that  p'  is  a  nonmasking  tolerant  strong  5-corrector  with  recovery  time  9  of  an  invariant 
predicate  ofp. 

Proof. 

Let  X  =  S,  and 

Z  =  R 

We  show  that  there  exists  5  £  Z>o  such  that  p'  refines  the  nonmasking  tolerance  specification  with  recovery  time  9  of  '  Z 
strongly  corrects  A'  within  5 ’  from  T.  In  particular,  we  first  show  that  a  computation  of  p'  starting  from  a  state  where  T  is  true 
reaches  a  state  where  R  is  true  within  9  time  units.  Then,  we  show  that  starting  from  this  state  p'  refines  '  Z  strongly  corrects 
X  within  5' . 

•  For  the  first  part,  since  p ’  becomes  p\R  within  9  from  T,  it  follows  that  p'  reaches  a  state  where  R  is  true  within  9. 

•  For  the  second  part,  we  show  that  there  exists  5  <E  Z>o  such  that  starting  from  this  state,  //  satisfies  Safeness,  Progress, 
Stability  and  Strong  Convergence  within  <5.  Since  R  =>  S  is  trivially  true.  Safeness  is  satisfied.  In  a  state  where  R  is  true. 
Progress  is  satisfied.  Since  p'  refines  p  from  R  and  R  is  closed  in  p'.  Stability  is  satisfied.  Finally,  since  in  a  computation 
starting  from  a  state  where  R  is  true,  S  is  immediately  true  at  all  states  and  //  is  closed  in  S,  by  choosing  5  =  0  Strong 
Convergence  is  satisfied.  I 

We  now  use  Claim  4.2  and  Lemma  4.3  to  show  that  if  a  nonmasking  /-tolerant  program  p'  with  recovery  time  9  is  designed 
by  reusing  p  then  there  exists  5  £  Z>o  such  that  p'  contains  a  nonmasking  /-tolerant  strong  5-corrector  with  recovery  time  9 
for  an  invariant  predicate  of  p. 


Theorem  4.4  Nonmasking  f -tolerant  programs  with  recovery  time  9  contain  nonmasking 
tolerant  strong  5-correctors  with  recovery  time  9',  for  some  9' .  That  is, 

if 

•  p  refines  SPEC  from  S, 

•  p'  is  nonmasking  f  -tolerant  for  SPEC  from  R,  where  R  =>  S, 

•  p'  refines  pfrom  R,  and 

•  p'  becomes  p\R  within  9  from  T  where  T  is  an  f-span  ofp1 

then 

•  there  exists  5  and  9'  in  Z>o  such  that  p'  is  a  nonmasking  f  -tolerant  strong  5-corrector  with  recovery  time  9' 
of  an  invariant  predicate  ofp. 


Proof.  Let  T\  be  the  fault-span  used  to  show  that  p'  is  nonmasking  fault-tolerant.  Let  X2  =  Ti  A  T.  Since  T\  and  T  are 
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/-spans,  both  are  closed  in  p'\\f .  Hence  T2  is  closed  in  //[]/.  Moreover,  since  T2  C  T,  p'  becomes  p\R  from  T 2. 

Now,  we  use  the  definition  of  Z  and  X  given  in  the  proof  of  Lemma  4.3  and  show  that  there  exists  5,9  £  Z>o  such  that  p'  is 
nonmasking  /-tolerant  with  recovery  time  O'  to  ’  Z  strongly  corrects  X  within  6’  from  R  and  the  /-span  of  p'  is  X2.  To  this 
end,  we  first  show  that  there  exists  5  £  Z>o  such  that  p'  refines  ’  Z  strongly  corrects  X  within  5'  from  R,  and  then  show  that 
p'[]f  refines  the  nonmasking  tolerance  specification  with  recovery  time  9  of '  Z  strongly  corrects  X  within  S'  from  T2. 

In  Lemma  4.3,  we  have  shown  that  there  exists  <5  £  Z>0  such  that  starting  from  any  state  in  R,  every  computation  of  //  satisfies 
Safeness,  Progress,  Stability,  and  Strong  Convergence  within  6.  It  follows  that  by  choosing  5  =  0,  p1  refines  '  Z  strongly 
corrects  X  within  S'  from  R. 

The  proof  of  Lemma  4.3  can  also  be  used  to  show  that  there  exists  <5  £  Z>o  such  that  //  refines  the  nonmasking  tolerance 
specification  with  recovery  tine  9  of  ' Z  strongly  corrects  X  within  S'  from  T2.  In  the  presence  of  /,  this  specification  may 
be  violated.  However,  by  Assumption  2.14,  after  at  most  k  faults  occur,  where  k  is  the  bound  on  the  number  faults  that  may 
occur  in  a  computation,  p'  reaches  a  state  where  R  is  true  within  at  most  k.9  time  units  (i.e.,  9'  =  k.9).  And,  from  this  state,  p' 
refines  ’ Z  strongly  corrects  X  within  <5(=  0)’.  Thus,  p'  is  nonmasking  /-tolerant  with  recovery  time  9 ’  =  \.fs\-9  +  Y1  /* to  ' % 
strongly  corrects  X  within  5(=  0)’  from  R.  I 

In  the  proof  of  Theorem  4.4,  although  we  showed  that  the  recovery  time  that  a  strong  5-corrector  provides  in  worst  case 
depends  on  the  maximum  number  of  occurrence  of  faults,  one  can  design  correctors  with  better  recovery  time.  For  instance, 
in  [7],  the  authors  present  automated  methods  to  synthesize  one  type  of  such  correctors  using  state  exploration  and  Dijkstra’s 
shortest  path  algorithm. 

Remark.  As  mentioned  earlier  in  this  section,  if  we  do  not  require  bounded-time  recovery  for  nonmasking  programs,  the 
results  presented  in  this  subsection  reduce  to  the  one  given  by  Arora  and  Kulkarni  [1],  in  which  nonmasking  tolerant  correctors 
are  defined  as  strong  oo-correctors. 

4.1.1  Example 

Consider  an  alternating  bit  protocol  where  an  infinite  input  array  at  a  sender  process  is  to  be  copied,  one  array  item  at  a  time, 
into  an  infinite  output  array  at  a  receiver  process.  The  sender  and  receiver  communicate  via  a  bidirectional  channel  that  can 
hold  at  most  one  message  in  each  direction  at  a  time. 

Fault-intolerant  program.  Iteratively,  a  simple  loop  is  followed:  sender  s  sends  a  copy  of  one  array  item  to  receiver  r. 
Upon  receiving  this  item,  r  sends  an  acknowledgment  to  s,  which  enables  the  next  array  item  to  be  sent  by  s  and  so  on.  Each 
iteration  is  supposed  to  take  at  most  r  time  units.  The  program  maintains  binary  variables  rs  in  s  and  rr  in  r;  rs  is  1  if  s 
has  received  and  acknowledgement  for  the  last  message  it  sent,  and  rr  is  1  if  r  has  received  a  message  but  has  not  sent  an 
acknowledgement  yet.  Furthermore,  the  program  maintains  a  timer  x  which  is  reset  every  time  s  sends  a  message.  Also,  s 
receives  an  acknowledgement  if  the  timer  is  not  expired.  The  0  or  1  messages  in  transit  from  s  to  r  are  denoted  by  the  sequence 
cs,  and  the  0  or  1  acknowledgments  in  transit  from  r  to  s  are  denoted  by  the  sequence  cr.  Finally,  the  index  in  the  input  array 
corresponding  to  the  item  that  s  will  send  next  is  denoted  by  ns ,  and  the  index  in  the  output  array  corresponding  to  the  item 
that  r  last  received  is  denoted  by  nr. 

The  intolerant  program  contains  five  actions,  the  first  two  in  s  and  the  next  two  in  r.  The  last  action  is  a  guarded  wait 
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command,  which  models  legitimate  delays.  The  specification  SPEC  requires  that  each  item  is  delivered  to  the  receiver  and 
the  sender  receives  the  corresponding  acknowledgement  within  3  time  units. 


ID1  :: 

{“} 

rs  =  1  - > 

rs,  cs  :=  0,  cs  o  (ns) 

ID2  :: 

cr  {)  Ax  <  t  — > 

rs,  cr,  ns  :=  1,  tail{cr),  ns  +  1 

IDS  :: 

cs  +  ()  — > 

cs,  rr,  nr  :=  tail{cs),  1,  head{cs) 

IDA  :: 

rr  =  1  — > 

rr,  cr  :=  0,  cr  o  {nr) 

IDS  :: 

x  <  3  — > 

wait 

Invariant.  When  r  receives  a  message,  nr  =  ns  holds,  and  this  equation  continues  to  hold  until  s  receives  and  acknowl¬ 
edgement.  When  s  receives  an  acknowledgement,  ns  is  exactly  one  greater  than  nr  and  this  equation  continues  to  hold  until 
r  receives  the  next  message.  Also,  if  cs  is  nonempty,  cs  contains  only  one  item,  {ns).  Finally,  at  any  state  exactly  two  actions 
are  enabled:  one  of  ID1...IDA  and  ID 5.  Thus  the  invariant  of  the  fault-intolerant  program  is: 

Sid  =  (x  <  t)  A  ((rr  =  1  V  cr  ^  ()))  A  (( rs  =  1  V  cs  /  ())  =>  nr  =  ns  —  1)  A 

(cs  =()  V  cs  =  {ns))  A  (|cs|  +  |cr|  +  rs  +  rr  =  1) 

Fault  Actions.  In  this  example,  the  immediate  fault  actions  cause  loss  of  either  a  message  or  an  acknowledgment.  The  delay 
faults  expire  the  transmission  timer  and  may  advance  it  for  at  most  2  time  units.  The  corresponding  fault  actions  are  as  follows: 


cs  ^  () 

— >  cs  :=  tail{cs) 

cr  ^  {) 

ii 

s- 

o 

I 

3  <  x  <  5 

— >  wait 

Nonmasking  fault-tolerant  program.  In  order  to  add  nonmasking  fault-tolerance  with  recovery  time  9  =  10,  we  design 
and  add  a  strong  5-corrector  CR  with  correction  predicate  X  =  Sid  (i.e..  the  invariant  of  the  intolerant  program),  witness 
predicate  Z  =  Sid  A  {a  \  x(a)  =  0}  (i.e.,  states  of  the  invariant  where  x  is  reset),  and  S  =  13.  Precisely,  CR  consists  of 
two  actions:  (1)  an  action  that  retransmits  the  last  sent  item,  and  (2)  a  delay  action  that  maintains  the  required  recovery  time  9. 
Thus,  the  actions  of  the  corrector  CR  are  as  follows: 


CR  1  ::  {cs  =  {)  A  cr  =  ()  A  rs  =  0  A  rr  =  0)  V  {x  >  r)  — ->■  cs  :=  cs  o  (ns) 

CR2  ::  3  <  x  <  13  - >  wait 


Thus,  the  nonmasking  program  consists  of  seven  actions;  five  actions  are  identical  to  the  actions  of  program  ID,  the  rest  are 
the  actions  of  the  corrector  CR  identified  above. 
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4.2  The  Role  of  Detectors  and  Week  5-Correctors  in  Hard-Failsafe  Fault- Tolerance 

In  this  Subsection,  we  present  a  more  rigorous  theory  of  detectors  than  the  one  presented  in  [1]  and  develop  the  theory  weak 
5-correctors.  Intuitively,  in  this  subsection,  our  goal  is  to  show  that  if  a  program  is  hard-failsafe  /-tolerant  then  it  contains 
hard-failsafe  tolerant  detectors  and  weak  (5-correctors.  We  present  this  result  in  Theorem  4.13. 

Our  proof  is  organized  as  follows:  first.  Lemma  4.6  shows  that  the  violation  of  the  untimed  part  of  the  safety  specification 
can  be  detected  from  the  current  state,  independent  of  how  that  state  is  reached.  Subsequently,  Lemma  4.7  shows  that  there 
exists  a  set  of  states  from  where  execution  of  the  program  maintains  the  given  safety  specification.  Next,  using  lemmas  4.6  and 
4.7,  in  Lemma  4.9,  we  show  that  for  any  program  p,  there  exists  a  unique  weakest  detection  predicate.  Then,  in  Claim  4.12, 
we  prove  that  if  a  program  refines  a  safety  specification  then  it  contains  detectors  and  weak  (5-correctors.  Finally,  in  Theorem 
4.13,  we  show  that  hard-failsafe  programs  contain  hard-failsafe  tolerant  detectors  and  weak  (5-correctors. 

Throughout  this  section,  let  p  be  a  program,  SPEC  be  a  specification,  SSPECbt  and  SSPECbr  be  the  minimal  safety  spec¬ 
ifications  that  contains  SPECbt  and  SPECbr  respectively.  She  a  prefix  of  a  computation,  /3  he  a  finite  suffix  of  a  computation, 
a  and  o'  be  states,  and  X  be  a  state  predicate. 

Assumption  4.5  Without  loss  of  generality,  for  simplicity,  we  assume  that  transitions  that  correspond  to  different  actions  of  the 
program  (except  guarded  wait  commands)  are  mutually  disjoint,  i.e.,  they  do  not  contain  overlapping  transitions.  The  results 
int  this  paper  are  valid  without  this  assumption  in  the  sense  that  we  can  easily  modify  the  given  program  to  one  that  satisfies 
this  assumption.  Moreover,  we  assume  that  program  computations  ar efair  in  the  sense  that  in  every  computation,  if  the  guard 
of  an  action  is  continuously  true  then  that  action  is  eventually  chosen  for  execution. 

Recall  that  in  order  to  show  the  existence  of  the  correctors  in  Subsection  4.1,  we  proved  that  the  fault-tolerant  program 
contains  a  corrector  for  an  invariant  predicate  of  the  fault-intolerant  program.  For  existence  of  detectors,  we  would  like  to 
show  that  the  fault-tolerant  program  contains  a  detector  for  a  detection  predicate  associated  with  the  fault-intolerant  program. 
However,  unlike  the  invariant  predicate,  we  need  to  identify  additional  syntactic  characteristics  of  program  before  detection 
predicates  can  be  identified.  With  this  intuition,  we  now  characterize  programs  syntactically.  First,  we  reiterate  two  lemmas 
from  [1],  Note  that  in  Lemma  4.6,  since  the  transitions  in  SPECbt  occur  instantaneously,  we  ignore  the  global  time  of  states. 

Lemma  4.6  (Arora  and  Kulkami  [1]) 

If 

•  acr  maintains  SPECbt 

then 

•  aa a'  maintains  SPECbt  iff  erer'  maintains  SPECbt  •  i 

Lemma  4.7  (Arora  and  Kulkarni  [1])  Given  program  p,  there  exists  a  predicate  such  that  execution  of  p  in  a  state  where 
that  predicate  is  true  maintains  SPECbt  •  i 

Definition  4.8  (detection  predicate)  We  say  that  A  is  a  detection  predicate  of  p  for  SPECbt  iff  execution  of  p  in  any 
state  where  X  is  true  maintains  SPECbt..  ■ 


16 


We  now  prove  the  uniqueness  of  the  weakest  detection  predicate  for  a  given  program  p. 


Lemma  4.9  Given  a  program  p,  there  exists  a  unique  weakest  detection  predicate  of  p  for  SPEC  bt- 

Proof.  Note  that  the  existence  of  detection  predicates  follows  from  Lemmas  4.6  and  4.7,  and  that  a  program  may  have  multiple 
detection  predicates.  Now,  let  sf  be  a  detection  predicate  of  p  for  SPECbt  and  X  be  an  arbitrary  state  predicate.  It  is  easy  to 
see  that  if  X  =>  sf ,  then  X  is  also  a  detection  predicate  of  p  for  SPECbt-  And,  if  sf\  and  sf2  are  detection  predicates  of  p  for 
SPECbt.  then  so  is  sfl  V  s/2.  Thus,  there  exists  a  unique  weakest  detection  predicate  for  the  given  program.  I 

Constraints  for  demonstrating  existence  of  detectors.  Now,  using  the  notion  of  timed  guarded  commands  (cf.  Section 
2),  we  precisely  define  what  it  means  for  a  fault-tolerant  program  to  contain  detectors.  In  particular,  given  a  timed  guarded 
command,  say  g  st,  Lemma  4.9  shows  that  there  is  a  unique  weakest  safe  predicate,  say  sf,  for  it.  Hence,  to  show  the 
existence  of  detectors,  we  require  that  the  detection  predicate  of  such  an  action  is  g  A  sf.  Furthermore,  to  show  that  the  fault- 
tolerant  program  contains  the  desired  detector,  we  show  that  it  must  be  using  the  witness  predicate  of  that  detector  to  ensure 
that  execution  of  the  corresponding  action  is  safe.  Towards  this  end,  we  define  the  notion  of  encapsulation.  Intuitively,  if  p' 
encapsulates  p  then  for  each  action  of  p  of  the  form  g  -^>  st,  p'  contains  an  action  of  the  form  g  A  g'  >  st |  |sf'.  (See  below 
for  the  precise  definition.)  In  other  words,  p'  has  an  action  corresponding  to  each  action  of  p.  To  show  that  p'  is  using  a  detector 
for  an  action  of  p,  we  require  that  the  witness  predicate  of  that  detector  he  g  A  g'  which  is  the  guard  of  the  corresponding  action 
in  p' . 

Based  on  the  above  discussion,  given  a  timed  guarded  command  of  the  form  g  —*  st  of  p,  its  (weakest)  detection  predicate 
sf  and  the  corresponding  action  g  A  g'  A^A  >  sf||sf'  of  p' ,  we  require  that  the  detection  predicate  of  the  desired  detector  be 
g  A  sf  and  the  witness  predicate  of  the  desired  detector  be  g  A  g' . 


Definition  4.10  (encapsulates)  We  say  that  p'  encapsulates  pfrom  S  iff  each  action  in  p'  that  is  enabled  in  a  state  in  S  and 
that  updates  variables  of  p  is  of  the  form  g  A  g'  A^A  >  st| \st',  where  g  -^»  st  is  an  action  of  p  and  st'  does  not  update  variables 
of  p  and  A'  does  not  include  clock  variables  that  are  reset  in  p.  Furthermore,  an  action  of  the  form  g  A  g'  A^A  >  st\\st',  is 
executed  only  when  its  guard,  g  A  g',  is  true,  and  to  execute  this  action  st  and  st'  are  atomically  executed.  I 


In  order  to  show  the  necessity  of  existence  of  detectors,  we  use  the  definition  of  detection  predicates  and  program  encapsu¬ 
lation.  This  necessity  applies  for  the  case  where  the  fault-tolerant  program  is  obtained  by  reusing  the  fault-intolerant  program. 
With  this  intuition,  we  next  define  what  we  mean  by  a  program  reusing  another  program. 


Definition  4.11  (reuses)  We  say  that  p'  reuses  pfrom  S  iff  the  following  two  conditions  are  satisfied: 

•  p'  refines  p  from  S,  and 

•  p'  encapsulates  p  from  S.  I 

Based  on  Definitions  4.10  and  4.11,  and  applying  the  concept  of  weak  (5-correctors,  we  are  now  ready  to  show  that  if  a 
program  refines  a  safety  specification  SPECbt  H  SPECbr  then  it  contains  detectors  and  weak  (5-correctors.  The  intuition  is 
that  if  program  p’  is  designed  by  transforming  p  so  as  to  satisfy  SSPECbt.  (respectively,  SSPECbr),  then  the  transformation 
must  have  added  a  detector  (respectively,  a  weak  (5-corrector)  for  p,  and  p'  reuses  p.  We  formulate  this  in  Claim  4. 12. 
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Claim  4.12  Programs  that  refine  a  safety  specification  contain  detectors  and  weak  5-correctors.  That  is, 

if 


•  p'  reuses  pfrom  S,  and 

•  p'  refines  SSPECbt  n  SSPEC pr  from  S 

then 

•  (Vac  |  ac  is  an  action  ofp  :  p'  contains  a  detector  of  a  detection  predicate  of  ac),  and 

•  (f/i  |  1  <  i  <  to:  there  exists  5  £  Z>o  such  that  p’  contains  a  weak  5-corrector  for  the  response  predicate 
of  the  stable  bounded  response  property  Pi  Qi  of  SPECbr)- 

Proof.  We  distinguish  two  cases  based  on  refinement  of  SPECbt  and  SPECbr- 

1.  ( existence  of  detectors)  Let  ac  be  an  action  of  p.  We  show  that  //  contains  a  detector  of  a  detection  predicate  of  ac.  Let 

sf  be  the  weakest  detection  predicate  for  ac.  Since  p '  reuses  p  from  S,  if  ac  is  of  the  form  g  —>  st,  p’  contains  an  action, 
say  ac' ,  of  the  form  g  A  g'  >  sf||si'. 

Let  Z  =  g  A  g' ,  and 
X  =  g  A  sf 

Since  X  =>  sf,  whenever  X  is  true,  execution  of  ac  maintains  SSPECbt-  It  follows  that  X  is  a  detection  predicate  of 
ac.  We  now  show  that  p'  refines  ‘Z  detects  X ’  from  S. 

•  By  definition  of  Z,  Z  =>  g.  Since  //  refines  SSPECbt  from  S,  whenever  ac  is  executed  in  a  state  where  S  is  true, 
its  execution  is  safe.  Since  sf  is  the  weakest  detection  predicate  of  ac,  S  A  X  =>■  sf.  Thus,  Safeness  is  satisfied. 

•  Consider  any  computation,  say  a7',  of  p'  which  starts  in  a  state  where  S  is  true  and  X  is  true  in  each  state  in  a7'. 
By  definition  of  X,  g  is  true  in  each  state  in  a7'.  Now,  consider  the  computation,  say  a,  obtained  by  projecting  a' 
on  p.  Since  //  refines  p  from  S',  a7  is  a  computation  of  p.  In  a7,  g  is  continuously  true.  Therefore,  by  fairness  (cf. 
Assumption  4.5),  action  ac  must  eventually  be  executed.  Let  a  denote  the  state  where  action  ac  executes  in  <r,  and 
let  a'  denote  the  corresponding  state  in  a7'.  Consider  the  action  ac'  executed  by  //  in  state  a' .  Since  we  assume  that 
the  transitions  included  in  different  guarded  commands  are  mutually  exclusive,  ac'  is  the  only  action  that  can  be 
executed  at  a' .  In  other  words,  no  other  action  in  p'  has  the  same  effect  on  variables  of  p  from  state  a.  Thus,  Z  is 
true  in  state  a'  and,  hence.  Progress  is  satisfied. 

•  If  Stability  does  not  hold,  it  implies  that  even  if  X  is  continuously  true,  the  computation  does  not  have  a  suffix 
where  Z  is  not  continuously  true.  If  Stability  does  not  hold  then  there  exists  a  computation  of  p'  where  the  action 
ac'  is  never  executed.  Now,  consider  this  computation  of  p'  and  its  projection  on  p.  In  the  projected  computation, 
g  is  continuously  true.  However,  action  ac  is  not  executed.  And,  this  is  a  contradiction  since  p'  reuses  p  from  S. 

2.  ( existence  of  weak  5-correctors )  As  mentioned  in  Section  2,  SPECbr  is  a  conjunction  of  m  £  Z>o  stable  bounded 

response  properties.  We  show  that  for  each  such  property  of  the  form  Pi  Qi,  where  1  <  i  <  in,  p'  contains  a 

weak  d-corrector. 


18 


Let  X  =  Qi,  and 

Z  =  Qi  A  {a  |  a  is  a  state  of  p'  such  that  a  is  reached  in  some  computation  of  p'  starting  from  Pi  AS1}. 

Now,  we  show  that  there  exists  <5  G  Z>o  such  thatp'  refines  ‘Z  weakly  corrects  X  within  S’  from  S: 

•  By  definition  of  Z,  in  any  state  where  Z  is  true,  Qi  is  true.  Thus,  Safeness  is  satisfied. 

•  Since  p'  refines  Pi  t —><,5.  Qi  from  S  every  computation  of  p'  starting  from  S  A  Pi  will  reach  a  state  where  S  A  Qi 
is  true  within  Si.  By  definition  of  Z,  Z  is  true  in  this  state  as  well.  Thus,  Progress  is  satisfied. 

•  By  the  same  token  and  choosing  S  =  Si,  Weak  Convergence  within  S  is  satisfied. 

•  Since  p'  refines  p  from  S,  it  follows  that  S  is  closed  in  p' .  Obviously,  the  second  conjunct  in  Z  is  closed  in  p'  as 
well.  Thus,  Z  is  closed  in  p'  and,  hence.  Stability  is  satisfied.  I 

We  now  use  Claim  4.12  to  show  that  if  a  hard-failsafe  /-tolerant  program  p'  is  designed  by  reusing  p  then  p'  contains 
a  hard-failsafe  tolerant  detector  for  each  action  of  p  and  a  hard-failsafe  tolerant  weak  ^-corrector  for  each  stable  bounded 
response  property. 


Theorem  4.13  Hard-failsafe  f -tolerant programs  contain  hard-failsafe  tolerant  detectors 
and  weak  S-correctors.  That  is, 

if 

•  p  refines  SPEC  from  S, 

•  p'  reuses  pfrom  R,  where  R  =>■  S,  and 

•  p'  is  hard-failsafe  f -tolerant  to  SPECbt  0  SPEC br  from  R 

then 

•  (V  ac  |  ac  is  an  action  ofp  :  p ’  is  a  hard-failsafe  f -tolerant  detector  of  a  detection  predicate  ofac),  and 

•  (Vi  |  1  <  i  <  m:  there  exists  S  £  Z>o  such  that  p'  contains  a  hard-failsafe  f  -tolerant  weak  5-corrector  for 
the  response  predicate  of  stable  bounded  response  property  Pi  i — ^<<5^  Qi  of  SPECbr)- 


Proof.  Similar  to  Claim  4.12,  we  distinguish  two  cases  based  on  refinement  of  SPECbt  and  SPECbr- 

1.  ( existence  of  hard-failsafe  tolerant  detectors )  Let  sf  be  the  weakest  detection  predicate  for  ac.  Since  p'  encapsulates  p 
from  R,  if  ac  is  of  the  form  g  st,  p'  contains  an  action,  say  ac' ,  of  the  form  g  A  g'  A^A  >  . 

Let  Z  =  g  A  g' ,  and  let 
X  =  g  A  sf 

Since  X  =>  sf,  whenever  X  is  true,  execution  of  ac  maintains  SPECbt-  It  follows  that  X  is  a  detection  predicate  of 
ac.  We  now  show  that  p'  is  hard-failsafe  /- tolerant  for  ‘Z  detects  X ’  from  R.  To  this  end,  we  first  show  that  p'  refines 
'  Z  detects  X ’  from  R.  Then,  we  show  that  there  exists  an  /-span  T  such  that  p'  []  /  refines  the  hard-failsafe  tolerance 
specification  of  ‘Z  detects  X’  from  T: 
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•  For  the  first  part,  since  R  =>  S,  we  observe  that  p'  refines  SSPECm  from  R.  Therefore,  as  in  Claim  4.12,  it  follows 
that  p'  refines  ’  Z  detects  X ’  from  R. 

•  For  the  second  part,  we  let  T  be  the  /-span  used  to  show  that  p'  is  hard-failsafe  /-tolerant  for  SPECbt  from  R.  In 
this  part,  we  need  to  show  that  a  computation  of  p'  []  /  satisfies  Safeness  and  Stability.  This  proof  is  identical  to  the 
first  part  of  proof  of  Safeness  and  Stability  in  Claim  4.12. 

2.  ( existence  of  hard-failsafe  tolerant  weak  6-correctors)  We  show  that  for  each  stable  bounded  response  property  of  the 
form  Pi  i— >< $4  Qi,  where  1  <  i  <  m,  there  exists  6  €  Z>o  such  that  p'  contains  a  hard-failsafe  tolerant  weak  5-corrector 
for  Qi. 

Let  X  =  Qi,  and 

Z  =  Qi  A  {a  |  a  is  a  state  of  p'  such  that  a  is  reached  in  some  computation  of  p'  starting  from  Pt } . 

Now,  we  show  that  there  exists  6  £  Z>o  such  that  p'  is  hard-failsafe  /-tolerant  for  ’Z  weakly  corrects  X  within  S'.  To 
this  end,  we  first  show  that  p'  refines  '  Z  weakly  corrects  X  within  5'  from  R.  Then,  we  show  that  there  exists  an  /-span 
T  such  that  p'  []  /  refines  the  hard-failsafe  tolerance  specification  of  ‘ Z  weakly  corrects  X  within  6’  from  T: 

•  For  the  first  part,  since  R  =>  .S',  we  observe  that  p1  refines  SSPEC i,r  from  R.  Therefore,  as  in  Claim  4.12,  it  follows 
that  there  exists  6  £  Z>o  (namely  6  =  Si)  such  thatp'  refines  ’  Z  weakly  corrects  X  within  6'  from  R. 

•  For  the  second  part,  we  let  T  be  the  /-span  used  to  show  that  p'  is  hard-failsafe  /-tolerant  for  SPEC  hr  from  R. 

In  this  part,  we  need  to  show  that  a  computation  of  //[]/  satisfies  Safeness,  Stability,  and  Weak  Convergence.  The 
proof  of  Safeness  and  Stability  is  identical  to  the  second  part  of  proof  of  Claim  4.12.  In  order  to  show  that  a 
computation  of  p' [] /  satisfies  Weak  Convergence,  first,  recall  that  by  Assumption  2.14,  at  most  k  faults  may  occur. 
Hence,  p'  reaches  a  state  where  Qi  is  true  within  at  most  k.Si  time  units.  Thus,  by  choosing  5'  =  \fs\.6i  +  /*  p' 

is  hard-failsafe  /-tolerant  to  '  Z  weakly  corrects  X  within  S'  from  R.  I 


4.2.1  Example 

We  now  illustrate  an  instance  of  detectors  and  weak  5-correctors  in  hard-failsafe  programs  using  an  altitude  switch  example 
from  [8],  The  altitude  switch  program  (Asw)  reads  a  set  of  input  variables  coming  from  an  altitude  meter.  Then,  it  powers  on  a 
Device  of  Interest  (Doi)  when  the  aircraft  descends  below  a  pre-specified  altitude  threshold.  There  exist  six  internal  variables, 
a  mode  variable  that  determines  the  operating  mode  of  the  program,  three  watchdog  timers,  and  an  input  variable  that  represents 
the  state  of  the  altitude  meter: 

•  The  internal  variables  are  as  follows:  (i)  mAltBelow  is  equal  to  1  if  the  altitude  is  below  a  pre-specific  threshold, 
otherwise,  it  is  equal  to  0;  (ii)  mDOIStatus  is  equal  to  1  if  the  Doi  is  powered  on,  otherwise,  it  is  equal  to  0;  (iii)  mlnit 
is  equal  to  1  when  the  system  is  being  initialized;  otherwise,  it  is  equal  to  0;  (iv)  mlnhibit  is  equal  to  1  when  the  Doi 
power-on  is  inhibited;  otherwise,  it  is  equal  to  0,  (v)  mReset  is  equal  to  0  if  a  system  reset  has  been  initiated.  Otherwise, 
it  is  equal  to  1,  and  (vi)  iCorruptSensor  is  equal  to  1  if  the  system  reads  an  invalid  value  from  the  altitude  meter  sensors. 
Otherwise,  it  is  equal  to  0. 
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•  The  Asw  program  can  be  in  three  different  modes:  (i)  initialization  mode  when  the  system  is  initializing;  (ii)  await  mode 
when  the  system  is  waiting  for  the  Doi  to  power  on,  and  (iii)  standby  mode.  We  use  the  variable  mcStatus  with  domain 
{Init,  Await,  Standby}  to  show  the  system  modes  in  the  program. 

•  We  model  the  signals  that  come  from  the  altitude  meter  by  the  variable  iSensor Reading.  This  variable  is  equal  to  _L 
when  the  system  fails  to  read  the  altitude  meter.  Otherwise,  it  is  not  equal  to  _L. 

•  We  introduce  three  watchdog  timers:  (i)  the  clock  variable  x  measures  the  time  elapsed  since  the  system  has  been  in  the 
Init  mode;  (ii)  the  clock  variable  y  measures  the  time  elapsed  since  the  system  has  failed  to  read  the  altitude  meter  (i.e., 
iCorruptSensor  =  1),  and  (iii)  the  clock  variable  2  measures  the  time  elapsed  since  the  system  has  been  in  the  Await 
mode. 

Fault-intolerant  program.  The  program  Asw  changes  its  mode  from  Init  to  Standby  within  1  second  (Action  Aswl  in 
the  program  below).  Also,  the  program  may  go  back  to  the  Init  mode  if  it  is  in  Standby  mode  and  the  reset  signal  is  received 
(Action  Asw2).  If  the  program  is  in  the  STANDBY  mode,  the  Doi  power-on  is  not  inhibited,  and  the  Doi  is  not  powered  on 
then  the  program  may  go  to  a  state  where  the  altitude  of  the  aircraft  is  below  the  threshold  and  it  is  in  the  Await  mode  (Action 
Asw3).  In  this  case,  the  program  starts  timer  2.  Otherwise,  the  program  stays  in  the  STANDBY  mode  for  an  arbitrary  time  as 
long  as  the  altitude  meter  is  not  showing  an  invalid  value  (Action  Asw6).  In  the  Await  mode,  the  program  either  (i)  powers  on 
the  Doi  within  2  seconds  and  goes  back  to  the  STANDBY  mode  (Action  Asw4),  or  (ii)  goes  to  the  Init  mode  upon  receiving 
the  reset  signal  (Action  Asw5).  The  program  may  take  delays  as  long  as  as  it  does  not  violate  the  timing  constraints  of  the 
program  (Action  Asw6).  If  the  program  receives  a  signal  indicating  that  the  altitude  meter  sensors  are  reading  an  invalid  value, 
it  sets  the  variable  iCorruptSensor  to  1  and  starts  the  timer  y  (Action  Asw8).  In  this  case,  the  program  should  go  back  to  the 
Init  mode  within  2  seconds  (Action  Asw9). 


Aswl 

Asw2 

Asw3 

Asw4 

Asw5 

Asw6 

Asw7 

Asw8 

Asw9 


( mcStatus  =  Init)  A  ( mlnit  =  1)  A  (at  <  1) 

( mcStatus  =  Standby)  A  ( mReset  =  0) 

( mcStatus  =  Standby)  A  ( mAltBelow  =  0)  A 
(■ mlnhibit  =  0)  A  ( mDOIStatus  =  0) 

( mcStatus  =  Await)  A  ( mDOIStatus  =  0)  A  (2  <  2) 

( mcStatus  =  Await)  A  (2  <  2) 

(( mcStatus  =  Standby)  A  ( iCorruptSensor  7^  1))  V 
{x  <  1)  V  (y  <  2)  V  (2  <  2) 

( mlnit  =  0) 

(iSensor Reading  =  _L)  A  ( iCorruptSensor  =  0) 

( mcStatus  =  Standby)  A  ( iCorruptSensor  =  1)  A  (y  <  2) 
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M 


mcStatus,  mlnit  :=  STANDBY,  0 
mcStatus,  mReset  :=  INIT,  1 

mcStatus,  mAltBelow  :=  Await,  1 
mcStatus,  mDOIStatus  :=  STANDBY,  1 
mcStatus,  mReset  :=  Init,  1 

wait 

mlnit  =  1 

iCorruptSensor  =  1 

mcStatus,  iCorruptSensor  :=  Init,  0 
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The  invariant  of  the  Asw  program  consists  of  the  following  states: 

<Sasw  =  ia  I  (( mcStatus(a )  =  Init)  (x(a)  <  1))  A 

(( mcStatus(a )  =  Standby)  =>  (( iCorruptSensor(a )  =  0)  V  (y(a)  <  2)))  A 
(( mcStatus(a )  =  Await)  =>  (z(a)  <  2))}. 

The  safety  specification  requires  that  the  program  does  not  change  its  mode  from  STANDBY  to  Await,  if  it  fails  to  read  the 
altitude  meter.  Also,  when  the  state  of  the  program  is  perturbed,  it  can  only  go  to  the  Init  mode.  Furthermore,  the  safety 
specification  has  three  bounded  response  properties  that  specifies  the  timing  constraints  of  the  altitude  switch.  Formally,  the 
safety  specification  is  as  follows: 

SPECbt,  =  { (<tq ,  cti )  |  (( iCorruptSensor  (ctq )  =  1)  A  ( mcStatus(ao )  =  Standby)  A  ( mcStatus{<Ji )  =  Await))  V 
((<jo  S'asw)  A  ((mcStatus  (cr  i)  =  Standby)  V  ( mcStatus(<j\ )  =  Await)))  V 
((ct0  £  S'asw)  A  (mReset(ai)  =  1))} 

SPECbri  =  (( mcStatus  =  Init)  t— »<!  ( mcStatus  =  Standby))  A 

(( mcStatus  =  Await)  i— ><2  (( mcStatus  =  Standby)  A  ( mDOIStatus  =  1))  V  ( mcStatus  =  Init)) 
SPEC br?  =  (( mcStatus  =  Standby)  A  ( iCorruptSensor  =  1))  t— »<2  ( mcStatus  =  Init) 

Fault  actions.  The  Asw  program  is  subject  to  a  set  of  delay  faults  when  the  program  is  (i)  initializing;  (ii)  in  a  state  where 
altitude  meter  shows  invalid  values,  or  (iii)  is  waiting  for  the  Doi  to  power  on.  Moreover,  faults  can  corrupt  the  reading  of  the 
altitude  meter  as  well. 


FI  ::  ( mcStatus  =  Init)  A  ( mlnit  =  1)  A  (1  <  x  <  2)  — > 

F2  ::  ( mcStatus  =  Standby)  A  ( iCorruptSensor  =  1)  A  (2  <  y  <  3)  — > 

F3  ::  ( mcStatus  =  Await)  A  (2  <  z  <  3)  — > 

F4  ::  ( iSensorReading  7^  _L)  — > 


wait 

wait 

wait 

iSensorReading  :=  _L 


Hard-failsafe  fault-tolerant  program  with  respect  to  SPECbt  and  SPECbr2 •  Observe  that  in  the  intolerant  program, 
action  Asw3  does  not  maintain  the  safety  specification  when  the  altitude  meter  shows  invalid  value  (i.e.,  fault  F4  occurs). 
To  preserve  the  safety  specification,  we  will  use  a  detector  DR.  It  is  easy  to  see  that  in  DR  the  detection  predicate  X  = 
(■ mcStatus  =  Standby)  A  ( iCorruptSensor  ^  1)  and  the  witness  predicate  Z  =  Guard(Asw3)  A  (iCorruptSensor  ^  1). 
Thus,  to  add  hard-failsafe  fault-tolerance,  Asw3  is  restricted  to  execute  only  when  the  witness  predicate  of  DR  is  true.  Thus, 
we  have 

Hfs_Asw3  ::  ( mcStatus  =  Standby)  A  ( mAltBelow  =  0)  A  (iCorruptSensor  ^  1)  A 

fz\ 

( mlnhibit  =  0)  A  ( mDOIStatus  =  0)  - >  mcStatus,  mAltBelow  :=  Await,  1 

Moreover,  we  require  that  if  the  program  fails  to  read  the  altitude  meter,  it  should  initialize  in  at  most  2  seconds  even  in  the 
presence  of  the  delay  fault  F 2  ::  ( mcStatus  =  Standby)  A  ( iCorruptSensor  =  1)  A  (2  <  y  <  3)  — >  wait.  To  preserve 
this  property,  we  will  use  a  weak  (5-corrector.  In  particular,  actions  Asw6  and  Asw9  can  potentially  be  such  a  corrector.  Thus, 
it  suffices  to  strengthen  action  Asw6  to  get  a  weak  1 -corrector  that  tolerates  F 2  by  allowing  delay  transitions  only  for  at  most 
1  second  when  the  program  is  in  the  STANDBY  mode  and  the  altitude  meter  reads  a  corrupted  value: 
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Hfs_Asw6  ::  (( mcStatus  =  Standby)  A  ( iCorruptSensor  ^  1))  V 
(x  <  1)  V  (y  <  1)  V  (z  <  2)  - >  wait 

Observe  that  Actions  Asw6  and  Asw9  in  the  intolerant  program  already  provide  the  required  computations  to  meet  the 
deadline  of  reaching  the  STANDBY  mode  in  the  fault-span.  Thus,  in  the  design  of  the  weak  (5-corrector,  we  did  not  add  any 
“recovery”  computations  to  the  tolerant  program.  In  fact,  in  order  to  design  the  weak  (5-corrector,  we  removed  the  computations 
that  do  not  meet  the  deadline  in  the  presence  of  faults  by  strengthening  the  Action  Asw6. 

4.3  The  Role  of  Real-Time  Fault-Tolerance  Components  in  Soft  and  Hard-Masking  Programs 

In  this  subsection,  we  present  the  theory  of  necessity  of  real-time  fault-tolerance  components  in  masking  tolerant  programs. 
We  distinguish  three  levels  of  masking  fault-tolerance,  namely,  (1)  soft-masking  with  bounded-time  recovery,  (2)  hard-masking 
with  bounded-time  recovery,  and  (3)  hard-masking  with  unbounded-time  recovery.  Recall  that  necessity  of  existence  of 
components  in  soft-masking  tolerance  with  unbounded-time  recovery  has  been  shown  in  [1].  For  reasons  space  and  sim¬ 
ilarity  of  proofs  to  previous  sections,  in  this  subsection,  we  only  outline  the  main  theorems  and  we  refer  the  reader  to 
http  :  /  /  www .  cse  .  msu  .  edu/  'borzoo/proofs  .  pdf  for  the  detailed  proofs. 

Soft-masking  with  bounded-time  recovery.  Since  a  soft-masking  program  must  maintain  SPECbt  in  the  presence  of  faults 
and  it  guarantees  bounded-time  recovery,  we  can  decompose  it  into  a  fault-intolerant  program,  detectors  to  refine  SPECbt ,  and 
strong  (5-correctors  to  provide  bounded-time  recovery. 


Theorem  4.14  Soft-masking  f -tolerant  programs  with  recovery  time  0  contain  soft- 
masking  tolerant  detectors  and  strong  5-correctors.  That  is, 

if 

•  p  refines  SPEC  from  S, 

•  p'  reuses  pfrom  R,  where  R  =>  S, 

•  p 1  refines  SPEC bt<ep\R from  T,  where  T  <=  R,  and 

•  p'  is  soft-masking  f -tolerant  to  SPEC  from  T, 

then 

•  (V  ac  |  ac  is  an  action  of  p  :  p'  is  a  soft-masking  f -tolerant  detector  of  a  detection  predicate  ofac), 

•  there  exists  5  such  that  p'  is  a  soft-masking  tolerant  strong  5-corrector  of  an  invariant  predicate  of  p,  and 

•  p'  is  a  nonmasking  f  -tolerant  5-corrector  with  recovery  time  0'  of  an  invariant  predicate  of  p  for  some  6'.  I 


Hard-masking  with  bounded-time  recovery.  Since  a  hard-masking  program  must  maintain  both  SPEC  bt  and  SPECbr  in 
the  presence  of  faults  and  it  guarantees  bounded-time  recovery,  we  can  decompose  it  into  a  fault-intolerant  program,  detectors 
for  refining  SPECbt ,  weak  (5-correctors  for  refining  SPECbr  and  strong  (5-correctors  to  provide  bounded-time  recovery. 
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Theorem  4.15  Hard-masking  f  -tolerant  programs  with  recovery  time  6  contain  hard- 
masking  tolerant  detectors,  weak  and  strong  5-correctors.  That  is, 

if 

•  p  refines  SPEC  from  S, 

•  p'  reuses  pfrom  R,  where  R  =>■  S, 

•  p'  becomes  p\R  within  9  from  T,  where  T  <=  R,  and 

•  p'  is  hard-masking  f -tolerant  to  SPEC  from  T, 

then 

•  (V  ac  :  ac  is  an  action  ofp  :  p'  is  a  hard-masking  f -tolerant  detector  of  a  detection  predicate  of  ac), 

•  (Vi  :  0  <  i  <  m:  there  exists  5  G  Z>o  such  that  p'  contains  a  hard-masking  f -tolerant  weak  5-corrector 

for  the  response  predicate  of  stable  bounded  response  property  Pi  Qi  of  SPEC  br), 

•  there  exists  51  such  that  p'  is  a  hard-masking  tolerant  strong  5' -corrector  of  an  invariant  predicate  ofp,  and 

•  p'  is  a  nonmasking  f  -tolerant  5' -corrector  with  recovery  time  6'  of  an  invariant  predicate  of  p  for  some  O' .  I 


Observe  that  in  case  of  unbounded-time  soft-masking  (respectively,  hard-masking)  programs,  in  Theorem  4.14  (respectively. 
Theorem  4. 15),  we  only  need  to  replace  the  strong  ((-correctors  (respectively,  strong  (('-correctors)  by  strong  oo-correctors.  The 
rest  of  the  statement  of  both  theorems  and  proofs  remain  unchanged. 

4.4  Example 

We  now  illustrate  the  role  of  strong  ((-correctors  in  hard-masking  fault-tolerance  in  the  Asw  program  presented  in  Subsection 
4.2.1.  More  specifically,  we  enhance  fault-tolerance  level  of  Asw  from  hard-failsafe  to  hard-masking  with  recovery  time  0  =  1. 
Using  the  hard-failsafe  program  in  Subsection  4.2.1,  it  only  remains  to  add  a  strong  ((-corrector  that  provides  recovery  to  the 
invariant.  Notice  that  according  to  SPECbt  of  Asw,  recovery  is  only  possible  to  the  states  where  the  program  is  in  the  Init 
mode,  which  (similar  to  the  alternating  bit  protocol)  in  turn  identifies  the  witness  predicate.  Thus,  the  actions  of  the  strong 
((-corrector  is  as  follows: 


CR1  ::  {l<x<2)  V  (2  <  2  <  3)  V 

((2  <  y  <  3)  A  ( iCorruptSensor  =  1))  ^  ’v’  -^>  mcStatus,  mReset  :=  Init,  1 

CR2  ::  (1  <  x  <  2)  V  (2  <  2  <  3)  V 

((2  <  y  <  3)  A  ( iCorruptSensor  =  1))  - >  wait 
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5  Conclusion  and  Future  Work 


In  this  paper,  we  focused  on  a  theory  of  fault-tolerance  components  in  the  context  real-time  programs.  We  showed  that  these 
components,  namely,  detectors  and  weak/strong  (5-correctors,  are  integral  parts  of  fault-tolerant  real-time  programs  that  reuse 
their  fault-intolerant  version.  In  particular,  we  showed  that  depending  upon  the  level  of  fault-tolerance,  namely  soft-failsafe, 
hard-failsafe,  nonmasking,  soft-masking,  and  hard-masking,  the  fault-tolerant  real-time  program  contains  one  or  more  of  such 
components. 

We  note  that  the  use  of  detectors  and  correctors  from  [1]  in  untimed  programs  has  been  illustrated  in  terms  of  several 
examples  such  as  repetitive  Byzantine  agreement,  mutual  exclusion,  tree  maintenance,  leader  election,  termination  detection, 
bounded  network  management,  etc.  We  are  currently  investigating  the  use  of  detectors  and  weak/strong  (5-correctors  for 
developing  design  methods  for  fault-tolerant  real-time  programs. 

We  are  also  focusing  on  using  these  components  in  the  context  of  automated  addition  of  fault- tolerance.  In  particular,  by 
designing  and  verifying  these  components  in  advance,  it  would  be  possible  to  speed  up  automated  addition  of  fault-tolerance  to 
real-time  programs.  Towards  this  end,  we  are  working  on  extending  our  current  work  on  automated  addition  of  fault-tolerance 
(cf.  http :  //www .  cse  .  msu  .  edu/  ~  sandeep/publications)  to  use  detectors  and  weak/strong  (5-correctors. 
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